8 research outputs found
TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement
We present a novel model for Tracking Any Point (TAP) that effectively tracks
any queried point on any physical surface throughout a video sequence. Our
approach employs two stages: (1) a matching stage, which independently locates
a suitable candidate point match for the query point on every other frame, and
(2) a refinement stage, which updates both the trajectory and query features
based on local correlations. The resulting model surpasses all baseline methods
by a significant margin on the TAP-Vid benchmark, as demonstrated by an
approximate 20% absolute average Jaccard (AJ) improvement on DAVIS. Our model
facilitates fast inference on long and high-resolution video sequences. On a
modern GPU, our implementation has the capacity to track points faster than
real-time, and can be flexibly extended to higher-resolution videos. Given the
high-quality trajectories extracted from a large dataset, we demonstrate a
proof-of-concept diffusion model which generates trajectories from static
images, enabling plausible animations. Visualizations, source code, and
pretrained models can be found on our project webpage.Comment: Published at ICCV 202